Search CORE

1 research outputs found

Text Categorization of Documents using K-Means and K-Means++ Clustering Algorithm

Author: Aditi Anand Shetkar, Sonia Fernandes
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/06/2016
Field of study

Text categorization is the technique used for sorting a set of documents into categories from a predefined set. Text categorization is useful in better management and retrieval of the text documents and also makes document retrieval as a simple task. Clustering is an unsupervised learning technique aimed at grouping a set of objects into clusters. Text document Clustering means clustering of related text documents into groups based upon their content. Various clustering algorithms are available for text categorization. This paper presents categorization of the text documents using two clustering algorithms namely K-means and K-means++ and a comparison is carried out to find which algorithm is best for categorizing text documents. This project also introduces pre-processing phase, which in turn includes tokenization, stop-words removal and stemming. It also involves Tf-Idf calculation. In addition, the impact of the three distance/similarity measures (Cosine Similarity, Jaccard coefficient, Euclidean distance) on the results of both clustering algorithms(K-means and K-means++) are evaluated. The dataset considered for evaluation consists of 600 text documents of three different categories- Festivals, Sports and Tourism in India. Our observation shows that for categorizing the text documents using K-Means++ clustering algorithm with Cosine Similarity measure gives better result as compared to K-means. For K-Means++ algorithm using Cosine Similarity measure purity of the cluster obtained is 0.8216

International Journal on Recent and Innovation Trends in Computing and Communication